Game Analytics: From Exploratory Data Analysis to Predictive Modeling
Author
Hoang Son Lai
Published
November 17, 2025
Introduction
The modern gaming landscape is fiercely competitive, where player retention and engagement are the ultimate currencies. Success is no longer solely determined by creative design and immersive gameplay but increasingly by the ability to understand and adapt to player behavior. This project, “Game Analytics: From Exploratory Data Analysis to Predictive Modeling,” demonstrates this data-driven paradigm by conducting a comprehensive analysis of Flappy Plane Adventure, a dynamic side-scrolling shooter.
Leveraging a rich dataset of 300 game sessions, this study moves beyond traditional descriptive statistics to uncover the deep-seated patterns that govern player success and failure. My journey begins with a thorough Exploratory Data Analysis (EDA), where I visualize performance distributions, identify the most common obstacles, and engineer advanced behavioral features such as aggressiveness, efficiency, and risk-taking to quantify playstyles.
I then tackle the challenge of a limited dataset through bootstrapping, artificially expanding my training data to build more robust and generalizable machine learning models. This foundation allows me to segment the player base into distinct behavioral profiles using unsupervised learning (K-Means Clustering), revealing clear archetypes from hesitant Beginners to seasoned Experts.
The core of this investigation lies in supervised predictive modeling. I develop and compare multiple algorithms to:
Predict final scores with near-perfect accuracy using a Random Forest regressor.
Forecast player survival beyond a critical 30-second threshold.
Anticipate the cause of a player’s death through a multiclass classification model.
Ultimately, this report transcends a mere technical exercise. Each model and visualization is meticulously interpreted to generate actionable, evidence-based recommendations for game balancing, targeted player engagement, and strategic monetization. My goal is to provide a clear blueprint for how data science can be practically applied to create a more enjoyable, balanced, and commercially successful gaming experience.
1. Data Overview & Processing
The data preparation stage begins by loading the raw game session CSV and converting timestamp strings into POSIX datetime objects for start_time and end_time. Missing or problematic values are handled (for example game_duration is set to 0 where missing), and several derived metrics are computed: score_per_second (score divided by duration) and accuracy (UFOs shot divided by bullets fired).
Code
# Load and clean the datagame_data <-read.csv("data/game_sessions.csv", stringsAsFactors =FALSE)# Data cleaning and preprocessinggame_data_clean <- game_data %>%mutate(start_time =as.POSIXct(start_time, format ="%Y-%m-%dT%H:%M:%OSZ"),end_time =as.POSIXct(end_time, format ="%Y-%m-%dT%H:%M:%OSZ"),death_reason =as.factor(death_reason),# Handle missing end_timegame_duration =ifelse(is.na(game_duration), 0, game_duration),# Create performance metricsscore_per_second =ifelse(game_duration >0, score / game_duration, 0),accuracy =ifelse(bullets_fired >0, ufos_shot / bullets_fired, 0) ) %>%filter(!is.na(start_time))
variable_description <-tibble(Variable =c("id","start_time","end_time","score","coins_collected","ufos_shot","bullets_fired","death_reason","game_duration","pipes_passed","score_per_second","accuracy" ),Description =c("Unique session identifier","Timestamp when the game session started","Timestamp when the game session ended","Final score achieved in the session. Score = coins_collected + (3 × ufos_shot)","Number of coins collected by the player","Number of UFO enemies shot","Total number of bullets fired","Cause of death (collision type / hazard)","Total session duration in seconds","Number of pipes the player successfully passed","Score normalized by session duration (score ÷ game_duration)","Shooting accuracy (ufos_shot ÷ bullets_fired)" ),Type =c("Character","Datetime","Datetime","Integer","Integer","Integer","Integer","Categorical","Numeric","Integer","Numeric","Numeric" ))variable_description %>%gt() %>%tab_header(title =md("**Variable Description - Plane Game Analytics**") ) %>%cols_width( Variable ~px(160), Description ~px(420), Type ~px(120) ) %>%tab_style(style =cell_text(weight ="bold"),locations =cells_column_labels() )
Table 1
Variable Description - Plane Game Analytics
Variable
Description
Type
id
Unique session identifier
Character
start_time
Timestamp when the game session started
Datetime
end_time
Timestamp when the game session ended
Datetime
score
Final score achieved in the session. Score = coins_collected + (3 × ufos_shot)
Integer
coins_collected
Number of coins collected by the player
Integer
ufos_shot
Number of UFO enemies shot
Integer
bullets_fired
Total number of bullets fired
Integer
death_reason
Cause of death (collision type / hazard)
Categorical
game_duration
Total session duration in seconds
Numeric
pipes_passed
Number of pipes the player successfully passed
Integer
score_per_second
Score normalized by session duration (score ÷ game_duration)
# Survival Timeline by Death Reasonformat_bin <-function(x) { x <-gsub("\\(", "", x) x <-gsub("\\]", "", x) x <-gsub("\\[", "", x) x <-gsub("\\)", "", x) x <-gsub(",", "-", x) x}game_data_binned <- game_data_clean %>%mutate(duration_bin =cut(game_duration,breaks =seq(0, 140, by =5),include.lowest =TRUE)) %>%filter(!is.na(duration_bin)) %>%mutate(duration_label =format_bin(as.character(duration_bin))) %>%count(duration_label, death_reason, name ="count")duration_levels <-format_bin(as.character(levels(cut(seq(0, 100, by =5),breaks =seq(0, 140, by =5),include.lowest =TRUE))))game_data_binned$duration_label <-factor(game_data_binned$duration_label,levels = duration_levels)timeline_plot <-ggplot( game_data_binned,aes(x = duration_label,y = count,color = death_reason,group = death_reason,text =paste0("<b>Death Reason:</b> ", death_reason, "<br>","<b>Duration:</b> ", duration_label, " sec<br>","<b>Count:</b> ", count ) )) +geom_line(size =0.7) +geom_point(size =1.5) +labs(title ="Survival Timeline by Death Reason",x ="Game Duration (seconds)",y ="Number of Deaths",color ="Death Reason" ) +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1, margin =margin(t =5)),axis.text.y =element_text(margin =margin(r =5)) )ggplotly(timeline_plot, tooltip ="text") %>%layout(title =list(text ="<b>Survival Timeline by Death Reason</b>", x =0.5, xanchor ="center",font =list(size =17) ),legend =list(orientation ="h",x =0.5,xanchor ="center",y =-0.25,yanchor ="top" ),xaxis =list(title_standoff =20 ),yaxis =list(title_standoff =20 ),margin =list(b =160) )
Figure 4: Survival Timeline by Death Reason
Code
# Distribution of score by death_reasonstats <- game_data_clean %>%group_by(death_reason) %>%summarise(count =n(),mean =mean(score),min =min(score),q1 =quantile(score, 0.25),median=median(score),q3 =quantile(score, 0.75),max =max(score) )df <-left_join(game_data_clean, stats, by ="death_reason")p <-plot_ly()unique_reasons <-unique(df$death_reason)for (dr in unique_reasons) { dsub <- df %>%filter(death_reason == dr) cd <-as.matrix(dsub[, c("count","mean","min","q1","median","q3","max")]) p <-add_trace( p,data = dsub,x =~death_reason,y =~score,type ="violin",name = dr,box =list(visible =TRUE),meanline =list(visible =TRUE),customdata = cd,hovertemplate =paste("<b>Death reason:</b> ", dr, "<br>","<b>Score:</b> %{y}<br><br>","<b>Count:</b> %{customdata[0]}<br>","<b>Mean:</b> %{customdata[1]:.2f}<br>","<b>Min:</b> %{customdata[2]}<br>","<b>Q1:</b> %{customdata[3]}<br>","<b>Median:</b> %{customdata[4]}<br>","<b>Q3:</b> %{customdata[5]}<br>","<b>Max:</b> %{customdata[6]}<extra></extra>" ) )} p %>%layout(title ="Score Distribution by Death Reason",xaxis =list(title ="Death Reason"),yaxis =list(title ="Score"))
Figure 5: Score Distribution by Death Reason
Code
# Expected Value of Score Lost per Death Typeev_loss <- game_data_clean %>%group_by(death_reason) %>%rename(`Death reason`= death_reason) %>%summarise(`Mean score`=mean(score),`Median score`=median(score),`Count of deaths`=n(),.groups ='drop' ) %>%arrange(desc(`Mean score`))ev_loss %>%kable()
# (A) Score vs Duration with trendlinescore_duration_plot <-ggplot(game_data_enhanced, aes(x = game_duration, y = score)) +geom_point(alpha =0.6, color ="#1f77b4") +geom_smooth(method ="loess", color ="#ff7f0e", se =TRUE) +labs(title ="Score vs Game Duration with Trendline",x ="Game Duration (seconds)",y ="Score") +theme_minimal()ggplotly(score_duration_plot)
Figure 7: Score vs Game Duration with Trendline
Code
# (B) Bullets vs UFO Shot efficiency_plot <-ggplot(game_data_enhanced,aes(x = bullets_fired, y = ufos_shot, color = skill_tier)) +geom_point(alpha =0.7) +geom_smooth(method ="lm", se =FALSE) +labs(title ="Bullets Fired vs UFOs Shot",x ="Bullets Fired", y ="UFOs Shot",color ="Skill Tier (by Score)") +theme_minimal()ggplotly(efficiency_plot)
cat("Holdout Test Size:", nrow(test_holdout), "\n")
Holdout Test Size: 50
Code
# Compare distribution real vs bootstrappedcompare_plot <-ggplot() +geom_density(data = train_base, aes(x = score, color ="Real"), size =1) +geom_density(data = train_bootstrapped, aes(x = score, color ="Bootstrapped"), size =1, alpha =0.7) +labs(title ="Score Distribution: Real vs Bootstrapped Data",x ="Score", y ="Density", color ="Data Type") +theme_minimal()compare_plot
Figure 11: Score Distribution: Real vs Bootstrapped Data
4. Segmentation (Unsupervised Learning)
Code
# Features for clusteringcluster_features_enhanced <- train_bootstrapped %>%select(score, game_duration, coins_collected, bullets_fired, ufos_shot, pipes_passed, aggressiveness, efficiency, accuracy, risk_taking)scaled_features_enhanced <-scale(cluster_features_enhanced)# KMeans clustering with 3 clustersset.seed(123)kmeans_enhanced <-kmeans(scaled_features_enhanced, centers =3, nstart =25)train_bootstrapped$cluster_enhanced <-as.factor(kmeans_enhanced$cluster)# Visualize clustersfviz_cluster(kmeans_enhanced, data = scaled_features_enhanced,geom ="point", ellipse.type ="convex",ggtheme =theme_minimal(),main ="Enhanced Player Segmentation (K-Means)")
ceiling enemy_bullet ground pipe ufo_collision
0.333 0.250 0.667 0.840 NaN
6. Business Insights & Recommendations
Based on the analysis above, we derive the following actionable insights:
6.1. Difficulty Balancing:
Observation: The death_reason analysis highlights the most common obstacles (e.g., pipes vs. enemies). If ‘pipe’ collisions are disproportionately high early in the game, the initial difficulty curve may be too steep.
Recommendation: Adjust the gap size or spawn rate of the leading cause of death in the first 10 seconds of gameplay to improve retention.
6.2. Player Segmentation Strategy:
Observation: K-Means clustering identified distinct groups. (Refer to cluster table: e.g., High-duration/low-coin collectors vs. Aggressive shooters).
Recommendation: Introduce targeted rewards.
For ‘Survivors’ (High duration, low action): Introduce time-based achievements.
For ‘Shooters’ (High bullets/UFOs): Offer weapon skins or visual upgrades for combat milestones.
6.3. Predictive Engagement:
Observation: The Random Forest model shows that specific actions (like coins_collected or ufos_shot) are strong predictors of high scores.
Recommendation: Create a tutorial or “Daily Mission” focusing on these high-value actions to teach new players how to achieve higher scores effectively.
6.4. Monetization Opportunities:
Observation: Players who survive past the 30-second threshold (analyzed in the Logistic Regression) show higher engagement.
Recommendation: Trigger “Continue?” ads or special offers only after a player has demonstrated this “expert” survival trait, as they are more invested in the session than a player who dies instantly.